fix: recovery from crash during optimization by zhourrr · Pull Request #246 · alibaba/zvec

zhourrr · 2026-03-20T09:27:47Z

Greptile Summary

This PR adds a crash recovery test for the Optimize operation, introduces a new collection_optimizer helper binary, and bundles several housekeeping fixes: typo corrections (CreateComapctTask → CreateCompactTask, new_segmnet_meta → new_segment_meta, etc.), a parameter rename for clarity (create → create_if_missing), and a relaxation of the floating-point comparison tolerance in Doc::operator==.

Key changes:

New collection_optimizer binary that opens a collection and calls Optimize(), used as a subprocess that can be killed to simulate a crash.
New OptimizeRecoveryTest::CrashDuringOptimize test that inserts 100 000 documents, crashes the optimizer mid-run, and then verifies the collection can be re-opened and data is intact.
Rename of all six CreateComapctTask call-sites (source + tests) to CreateCompactTask.
Typo fixes in parameter names and log messages across segment.cc, segment.h, and index_storage.h.
Float/double comparison epsilon in doc.cc loosened (1e-6f → 1e-4f, 1e-9 → 1e-6).

Issues found:

The final data-integrity verification block in CrashDuringOptimize (after the second RunOptimizer call) calls collection->Insert() instead of collection->Fetch() + compare — it inserts documents instead of verifying them, so data corruption would go undetected.
write_recovery_test.cc SetUp/TearDown still reference the old crash_test_db path after the variable was renamed to write_recovery_test_db, leaving the test directory uncleaned between runs.
collection_optimizer.cc opens the collection with CollectionOptions{false, true} (no segment limit), while the test creates/reopens it with CollectionOptions{false, true, 256 * 1024}, causing a configuration mismatch that may affect whether compaction is triggered.

Confidence Score: 2/5

Not safe to merge: the new crash-recovery test has a copy-paste bug that makes the final data-integrity verification a no-op, and the write recovery test has stale cleanup paths.
Three P1 bugs in the new test files: (1) the final verification loop inserts documents instead of fetching and comparing them — the test's core correctness guarantee is missing; (2) SetUp/TearDown in write_recovery_test.cc delete the wrong directory, causing test pollution; (3) the collection_optimizer binary uses different CollectionOptions than the test, potentially preventing the optimizer from exercising the intended code paths. The production-code changes (typo renames, epsilon relaxation) are low-risk, but the test infrastructure issues are serious enough to block merging.
tests/db/crash_recovery/optimize_recovery_test.cc (final verification loop), tests/db/crash_recovery/write_recovery_test.cc (SetUp/TearDown paths), tests/db/crash_recovery/collection_optimizer.cc (CollectionOptions mismatch)

Important Files Changed

Filename	Overview
tests/db/crash_recovery/optimize_recovery_test.cc	New crash-recovery test for optimize; has a critical copy-paste bug where the final data-integrity loop calls Insert() instead of Fetch(), a flaky WIFSIGNALED assertion, and commented-out dead code.
tests/db/crash_recovery/write_recovery_test.cc	Renamed collection/dir identifiers but SetUp and TearDown still reference the old "crash_test_db" path, leaving write_recovery_test_db uncleaned between runs.
tests/db/crash_recovery/collection_optimizer.cc	New optimizer helper binary; opens collection with CollectionOptions{false, true} missing the 256*1024 segment limit used by the test, and is missing std::endl on the final stats output line.
tests/db/crash_recovery/CMakeLists.txt	Extracts common library list into CRASH_RECOVERY_COMMON_LIBS variable and adds collection_optimizer build target and dependency; clean refactor.
src/db/collection.cc	Fixes three call-sites of the renamed CreateComapctTask → CreateCompactTask; straightforward typo correction.
src/db/index/segment/segment_helper.h	Renames CreateComapctTask to CreateCompactTask; simple typo fix with all callers updated.
src/db/index/common/doc.cc	Loosens float comparison tolerance from 1e-6f to 1e-4f and double from 1e-9 to 1e-6; may be needed for round-trip precision but is a meaningful behavioral change to Doc equality.

Sequence Diagram

sequenceDiagram
    participant T as Test
    participant C as Collection in-process
    participant O as collection_optimizer subprocess
    participant D as On-disk Collection

    T->>C: CreateAndOpen(dir_path_)
    T->>C: Insert 100k docs
    C->>D: Persist segments
    T->>C: collection.reset()

    T->>O: fork + execvp RunOptimizerAndCrash 4s
    O->>D: Open with CollectionOptions{false,true}
    O->>D: Optimize compaction starts
    T-->>O: SIGKILL after 4s

    T->>C: Open for recovery
    T->>C: Fetch all 100k docs and verify integrity
    T->>C: Insert 50k more docs
    T->>C: collection.reset()

    T->>O: fork + execvp RunOptimizer no crash
    O->>D: Open and Optimize completes cleanly
    O-->>T: exit 0

    T->>C: Open final
    T->>C: Stats assert 150k docs
    T->>C: BUG Insert 150k docs instead of Fetch and verify

Comments Outside Diff (2)

tests/db/crash_recovery/write_recovery_test.cc, line 129 (link)

SetUp/TearDown reference stale path after rename

dir_path_ was renamed from "crash_test_db" to "write_recovery_test_db" in this PR, but both SetUp (line 129) and TearDown (line 134) still attempt to delete the old "./crash_test_db" path. As a result:
- The actual test directory write_recovery_test_db is never cleaned up between test runs, causing test pollution and potential false-positives on repeated runs.
- The delete of crash_test_db is a no-op since that directory no longer exists.
tests/db/crash_recovery/write_recovery_test.cc, line 134 (link)

Stale path in TearDown after rename

Same rename miss as in SetUp. The TearDown must also reference the updated directory name.

_{Last reviewed commit: "fix: fix some typos"}

Greptile also left 5 inline comments on this PR.

greptile-apps · 2026-03-20T09:32:01Z

tests/db/crash_recovery/optimize_recovery_test.cc

+  for (int batch = 0; batch < (num_batches + 1000); batch++) {
+    std::vector<Doc> docs;
+    for (int i = 0; i < batch_size; i++) {
+      docs.push_back(CreateTestDoc(batch * batch_size + i, 0));
+    }
+    auto write_result = collection->Insert(docs);
+    ASSERT_TRUE(write_result);
+    for (auto &s : write_result.value()) {
+      ASSERT_TRUE(s.ok());
+    }
+  }


Final verification loop inserts instead of verifying

The comment on line 178 says "verify data integrity", but the loop body calls collection->Insert() instead of collection->Fetch() + comparing results. This is a copy-paste error from the data population block above. The test will modify the collection instead of validating it, and any data-integrity bug would go undetected here.

Compare this to the correctly-written verification block at lines 142–157, which uses collection->Fetch(pks) and then checks *actual_doc == expected_doc.

Suggested change

for (int batch = 0; batch < (num_batches + 1000); batch++) {

std::vector<Doc> docs;

for (int i = 0; i < batch_size; i++) {

docs.push_back(CreateTestDoc(batch * batch_size + i, 0));

}

auto write_result = collection->Insert(docs);

ASSERT_TRUE(write_result);

for (auto &s : write_result.value()) {

ASSERT_TRUE(s.ok());

}

}

for (int batch = 0; batch < (num_batches + 1000); batch++) {

for (int i = 0; i < batch_size; i++) {

Doc expected_doc = CreateTestDoc(batch * batch_size + i, 0);

std::vector<std::string> pks{expected_doc.pk()};

if (auto res = collection->Fetch(pks); res) {

auto map = res.value();

if (map.find(expected_doc.pk()) == map.end()) {

FAIL() << "Returned map does not contain doc[" << expected_doc.pk() << "]";

}

const auto actual_doc = map.at(expected_doc.pk());

ASSERT_EQ(*actual_doc, expected_doc)

<< "Data mismatch for doc[" << expected_doc.pk() << "]";

} else {

FAIL() << "Failed to fetch doc[" << expected_doc.pk() << "]";

}

}

}

greptile-apps · 2026-03-20T09:32:09Z

tests/db/crash_recovery/collection_optimizer.cc

+  auto result =
+      zvec::Collection::Open(config.path, zvec::CollectionOptions{false, true});


CollectionOptions mismatch with test configuration

The test file optimize_recovery_test.cc creates and reopens the collection with zvec::CollectionOptions{false, true, 256 * 1024} (a max-doc-per-segment limit of 256 * 1024), but the collection_optimizer binary opens the same on-disk collection with only zvec::CollectionOptions{false, true} (no segment size limit). This mismatch means the optimizer may behave differently than expected during the crash recovery test — for example, it might not trigger segment compaction at the same thresholds, leading to a test that doesn't actually exercise the intended code paths.

Suggested change

auto result =

zvec::Collection::Open(config.path, zvec::CollectionOptions{false, true});

zvec::Collection::Open(config.path, zvec::CollectionOptions{false, true, 256 * 1024});

greptile-apps · 2026-03-20T09:32:10Z

tests/db/crash_recovery/optimize_recovery_test.cc

+  ASSERT_TRUE(WIFSIGNALED(status))
+      << "Child process was not killed by a signal. It exited normally?";


Flaky ASSERT_TRUE(WIFSIGNALED) when process exits before kill

If the optimizer finishes naturally within the seconds window, kill(pid, 0) will return -1 (process gone), the kill is skipped, and waitpid will collect a normal-exit status — causing ASSERT_TRUE(WIFSIGNALED(status)) to fail with "Child process was not killed by a signal." The same pattern already exists in write_recovery_test.cc (RunGeneratorAndCrash), but the concern is amplified here because on faster machines the optimize of 2000 * 50 = 100 000 documents might complete quickly.

Consider using a softer assertion (e.g., accept either WIFEXITED or WIFSIGNALED) or asserting that the remaining verification logic still holds regardless of which exit path occurred.

greptile-apps · 2026-03-20T09:32:11Z

tests/db/crash_recovery/collection_optimizer.cc

+  if (s.ok()) {
+    std::cout << "Optimize completed successfully" << std::endl;
+    // Print final stats
+    std::cout << "Final stats: " << collection->Stats()->to_string_formatted();


Missing newline after final stats output

The final stats line has no std::endl or "\n", so the output may be buffered and mixed with subsequent output (e.g., the process's exit trace) without a line break. All other output statements in this file include std::endl.

Suggested change

std::cout << "Final stats: " << collection->Stats()->to_string_formatted();

std::cout << "Final stats: " << collection->Stats()->to_string_formatted() << std::endl;

greptile-apps · 2026-03-20T09:32:12Z

tests/db/crash_recovery/optimize_recovery_test.cc

+  // VectorQuery query;
+  // query.output_fields_ = {"name", "age"};
+  // query.filter_ = "invert_id >= 6000 and id < 6080";
+  // query.topk_ = 100;
+  // std::vector<float> feature(4, 0.0);
+  // query.query_vector_.assign((const char *)feature.data(),
+  //                            feature.size() * sizeof(float));
+  // query.field_name_ = "dense";
+  // collection->Query();


Commented-out dead code should be removed

Leaving a commented-out VectorQuery block in the test adds noise without providing value. If this functionality is intended to be tested, it should be enabled; otherwise, it should be deleted.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

zhourrr added 4 commits March 20, 2026 17:45

feat: add ut

1021ad7

feat: add more ut

29f102d

fix: fix some typos

618b2f7

fix: clean up corrupted files

d303bdc

zhourrr force-pushed the feat/more_ut branch from 6e3f28f to d303bdc Compare March 20, 2026 09:46

fix: ut

dc112c4

zhourrr requested review from chinaux and feihongxu0824 March 20, 2026 09:56

zhourrr assigned chinaux and feihongxu0824 Mar 20, 2026

zhourrr changed the title ~~feat: add crash recovery test for optimize~~ fix: recovery from crash during optimization Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: recovery from crash during optimization#246

fix: recovery from crash during optimization#246
zhourrr wants to merge 5 commits intomainfrom
feat/more_ut

zhourrr commented Mar 20, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

greptile-apps bot Mar 20, 2026

Uh oh!

greptile-apps bot Mar 20, 2026

Uh oh!

greptile-apps bot Mar 20, 2026

Uh oh!

greptile-apps bot Mar 20, 2026

Uh oh!

greptile-apps bot Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-  for (int batch = 0; batch < (num_batches + 1000); batch++) {
-    std::vector<Doc> docs;
-    for (int i = 0; i < batch_size; i++) {
-      docs.push_back(CreateTestDoc(batch * batch_size + i, 0));
-    }
-    auto write_result = collection->Insert(docs);
-    ASSERT_TRUE(write_result);
-    for (auto &s : write_result.value()) {
-      ASSERT_TRUE(s.ok());
-    }
-  }
+  for (int batch = 0; batch < (num_batches + 1000); batch++) {
+    for (int i = 0; i < batch_size; i++) {
+      Doc expected_doc = CreateTestDoc(batch * batch_size + i, 0);
+      std::vector<std::string> pks{expected_doc.pk()};
+      if (auto res = collection->Fetch(pks); res) {
+        auto map = res.value();
+        if (map.find(expected_doc.pk()) == map.end()) {
+          FAIL() << "Returned map does not contain doc[" << expected_doc.pk() << "]";
+        }
+        const auto actual_doc = map.at(expected_doc.pk());
+        ASSERT_EQ(*actual_doc, expected_doc)
+            << "Data mismatch for doc[" << expected_doc.pk() << "]";
+      } else {
+        FAIL() << "Failed to fetch doc[" << expected_doc.pk() << "]";
+      }
+    }
+  }

		auto result =
		zvec::Collection::Open(config.path, zvec::CollectionOptions{false, true});

		ASSERT_TRUE(WIFSIGNALED(status))
		<< "Child process was not killed by a signal. It exited normally?";

	std::cout << "Final stats: " << collection->Stats()->to_string_formatted();
	std::cout << "Final stats: " << collection->Stats()->to_string_formatted() << std::endl;

Conversation

zhourrr commented Mar 20, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

greptile-apps bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhourrr commented Mar 20, 2026 •

edited by greptile-apps bot

Loading